Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Data quality measures #420

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nickevansuk
Copy link
Contributor

@nickevansuk nickevansuk commented Apr 5, 2023

This PR introduces data quality measures, based on validator results.

Measures are defined by ”exclusions”, which are references to specific types of validator errors. When a measure is calculated, and item is counted towards the measure total count unless is it is “excluded” by one of the validator errors referenced by its “exclusions”.

For example:

{
  name: 'Has a name',
  description: 'The name of the opportunity is essential for a participant to understand what the activity',
  exclusions: [
    {
      errorType: [
        ValidationErrorType.MISSING_REQUIRED_FIELD,
      ],
      targetFields: {
        Event: ['name'],
        FacilityUse: ['name'],
        IndividualFacilityUse: ['name'],
        CourseInstance: ['name'],
        EventSeries: ['name'],
        HeadlineEvent: ['name'],
        SessionSeries: ['name'],
        Course: ['name'],
      },
    },
  ],
},

In the above measure, an item will not be counted towards the total and percentage if the validator error “MISSING_REQUIRED_FIELD” is present for the target fields “name”.

The advantage of this approach is that the complex inheritance rules respected by the validator are implicitly considered, and that more complex validation rules such as activity list matching are easily included without any duplicated logic. Tests can easily be written for complex rules also, as the validator already provides a framework for this.

This increases maintainability, flexibility, and consistency of results across tools. The approach is also extensible, and encourages the creation of new data quality rules in the validator as data quality measures become more in-depth: this has the advantage of surfacing errors at a more detailed level within the various OA tools, as well as providing a high-level summary.

Measures are defined within “profiles”, which allows for subsets of measures to be defined distinctly for different use cases (e.g accessibility).

Measures are defined within this repository, so that they can be used within both the Validator GUI and the Test Suite, and be maintained alongside the validation rules on which they depend.

(Note that this PR is in draft, and requires some refactoring and tidying up before merging)

Screenshot of unstyled results below:
Screenshot 2023-04-05 at 09 52 21

Open questions:

  • How would we ideally display this visually to users? (The output is a simple mustache template; rough design spreadsheet here)
  • Do we need to think about combining parent and child in the feed within the test suite for a more accurate assessment of e.g. the url? (Less relevant for the current measures, which are mostly based on required fields)

@howaskew
Copy link

howaskew commented Apr 5, 2023

Here's an example output from my work via the visualiser...

Screenshot 2023-04-05 at 12 38 37

The idea is a simple, intuitive, visual summary of the smaller set of DQ metrics discussed at W3C. It's a stepping stone into the detail in the validator report.

@nickevansuk
Copy link
Contributor Author

@howaskew looks great! Postcode validation is a great example of a rule that would be helpful in the validator too (centralising logic etc).

It is cool having it visible in the visualiser as data users might be browsing feeds there - am thinking about whether setting up the validator to build as a lightweight client-side library might give us the best of both worlds - centralising logic and still having the view on the visualiser...

Or even easier we could just store nightly DQ reports which we embed in a tab on the visualiser, and reference on the status page. That might be even better? One pre-cached source of truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants